Synthesis plan
We will conduct an individual participant data meta-analysis, by modelling a Bayesian, beta-binomial hierarchical regression. Our effect size will be presented as odds ratio between baseline and intervention phase, which represents the trend and variability aspects of SCED analysis according to WWC guidelines. The analyses will be performed in R (R Core Team, 2023; R version 4.3.1), using the brms (Bürkner (2017)) package, an R wrapper for the probabilistic programming language Stan (Stan Development Team, 2024).
We will conduct separate analyses for each category of reading intervention (phonological awareness, fluency, and phonics), as we do not consider them conceptually homogeneous enough to aggregate in one model. Likewise, we will group the analyses based on designs that can be synthesized in the same model. With four groups based on design, and three interventions (phonological awareness, fluency, and phonics), we anticipate at most 12 main meta-analytic models, given that we find at least two eligible studies for each category. Table 2 illustrates possible combinations and designs we plan to synthesise together.
Table 2
Study design and intervention matrix
| 1. AB and Multiple Baseline Designs |
AB Design Multiple Baseline across participants Multiple Baseline across settings/situations Multiple Baseline across time Multiple Probe Design |
A) Phonological Awareness B) Fluency C) Phonics |
| 2. Changing Criterion Designs |
Changing Criterion (CC) Design |
A) Phonological Awareness B) Fluency C) Phonics |
| 3. Multiple-Treatment Designs |
Multi-element Design Alternating-Treatment Design (ATD) Adapted Alternating-Treatment Design Simultaneous-Treatment Design |
A) Phonological Awareness B) Fluency C) Phonics |
| 4. Reversal (ABAB) Designs |
ABAB (Reversal/Withdrawal) Design |
A) Phonological Awareness B) Fluency C) Phonics |
| Combination Designs / Other |
NA |
Not applicable |
In the first group, we will aggregate multiple baseline designs (and their variants), and multiple probe designs. We consider these designs similar in logic, and as long as the intervention and outcome are kept homogeneous, we can model the variations in measurement and intervention setup across stages as random variation.
In the second group, we will synthesise changing-criterion (CC) designs. We will model CC designs according to the same formula as multiple baseline/AB designs, but separately from those designs. This aligns with the suggested analytical approach by W. R. Shadish, Kyse, and Rindskopf (2013), which suggested an analytical approach where one regards all criterion phases as the intervention phase and compares to the baseline (ignoring the stabilizing phases between criterion phases). Although modelling the CC design this way would functionally make it comparable to multiple baseline/AB design setup, the staggered nature of CC designs might artificially flatten the slope compared to AB or multiple baseline designs, where the mastery of the final desired outcome may come immediately, or at least early in the intervention phase.
The third group will consist of comparative study designs designs (Ledford & Gast, 2018). For studies using ATD, we will follow a structured extraction workflow: First, we will verify that a baseline phase is present, either as a pre-intervention phase or a comparison phase without treatment. If no baseline is available, the study will be excluded from extraction. If a baseline is present, we will assess the treatment phases. If both treatments do not belong to the same intervention category, the study will not be included in the analysis. Studies without a baseline/no treatment condition or comparing different intervention categories would not answer our research question which does not focus on the relative effectiveness of interventions. The baseline data will be explicitly modeled as “Baseline,” and the superior treatment data will be modeled as “Intervention,” extracting all available points. In all cases, data extraction will be done sequentially to maintain the chronological order of sessions in the dataset. These analyses follow modelling alternatives suggested by Manolov, Onghena, and Noortgate (2023) and Moeyaert et al. (2015), based on the decision trees driven by the research question.
If we happen to find alternating/reversal phase designs (i.e., ABAB designs), we will aggregate them in the fourth group, although due to the nature of the intervention (learning effects), we do not expect to find studies with these designs.
We will not model interventions categorized as other as we expect that category to be overly heterogeneous.
We will model the baseline and intervention phase separately in our main meta-analysis model, with interaction with time (session). We will also not model autocorrelation. Although the literature suggests the existence of moderate autocorrelation, particularly in multiple-baseline designs (William R. Shadish and Sullivan (2011)), we could model it as an exploratory analysis only due to unknown estimates.
To answer our research question, we will provide posterior distribution plots of the average intervention effect (fixed effect) and the interaction effect between session and intervention. Furthermore, we will plot the posterior distributions of the random-effects estimates (SD) to assess heterogeneity of intervention effects between studies, and between participants within studies. We will provide estimates with 90% credible intervals for each effect, allowing readers to understand both the average intervention impact and the variability of effects across different studies and participants. As estimates will be presented on a log-odds scale, we will supplement them with converted odds ratios for interpretability.
Level one of the multilevel model describes the relationship between the outcome - word or non-word reading, and the predictor - reading intervention. Informed by previous reviews with similar interest, we assume majority of the studies will employ non-standardized word-reading tests, with counts and percentages as measures, and fixed word lists. We assume the existence of autocorrelation (but will not model it explicitly) and knowledge transfer, therefore, we cannot assume the trials are independent. To allow for greater flexibility and account for over-dispersion (i.e., variance being larger than the mean), and to account for unavailable raw counts in instances where the studies present proportion of correct responses, we will model the outcome under the beta-binomial distribution (Gelman et al. (2020)). Figure 1 illustrates the multiple levels in a meta-analysis.
To calculate the effect for our meta-analysis, we need to account for participant and study clusters. There are two ways to approach the raw data - calculate average effects per study and do a meta-analysis of those effects (which is the common way of conducting a meta-analysis when you don’t have raw data) or build a model that directly calculates the population (average) effect from the raw data. We will do the latter, the one stage approach.
Figure 1 
Thus, our starting model is as follows:
\[
Y_{ijk} \sim BetaBin(N_{ijk}, \mu_{ijk}, \theta)
\]
Beta-binomial distribution can be explained as a binomial distribution that has its probability parameter informed by the beta distribution of probabilities. \(Y_{ijk}\) represents the correct response rate for each individual (j) within a study (k) at a certain time point i = 1, …, n. \(BetaBin\) indicates the observations are drawn from a beta-binomial distribution which has the \(N_{ijk}\) parameter for the number of trials, \(\mu_{ijk}\) for the average probability (Some authors denote it as pbar; McElreath (2020)) of the beta distribution and \(\theta\) as the dispersion (precision) parameter of the beta distribution (some authors might note it as \(\phi\); Bürkner (2017)).
\[
\text{logit}(\mu_{ijk}) = \alpha + \beta_1 ID_{j} + \beta_2 \text{session}_{i} + \beta_3 ID_j*\text{session}_{i} + g_{jk} + g_k
\]
Here, \(\alpha\) is the global intercept, \(\beta_1\) represents the main effect of the ID (i.e., phase), \(\beta_2\) represents the main effect of session (i.e., measurement occasions), \(\beta_3 ID_j*session_i\) represents the interaction of session and phase, \(g_{jk}\) is the participant-level random effect nested within the study, \(g_k\) is the study-level random effect. The ID variale stands for phase. In case of AB, multiple baseline and changing criterion designs, phase will represent both baseline and intervention. In eligible alternating treatment designs, it will represent baseline, intervention 1 and intervention 2.
The multilevel model now specifies random effects on the participant (\(g_{j}\)) and study level (\(g_{k}\)). We also define parameters for priors for the means and standard deviations. As these effects are interdependent, the model also produces a correlation matrix to explain the covariance. Weakly informative priors drawn from a Student-t or a Cauchy distribution have been recommended by Gelman (2006) for hierarchical models, particularly when the number of groups is larger than 3, which we expect in our meta-analyses. Suggested informative priors come from Grekov, Pustejovsky, and Klingbeil (2025). However, their specified priors fit the baseline phase only, while we also include the intervention phase. Nevertheless, their data are highly similar to our target data, as they modelled reading fluency outcomes from single-case designs.
Furthermore, we will incorporate immediate level changes and time trends through session effects and their interaction with the phases. This allows us to capture both the immediate difference and gradual changes happening during the intervention/baseline phase. The model includes random slopes for both session and intervention × session effects at both study and case levels, enabling us to detect heterogeneity in learning trajectories across studies and participants.
LKJ, the Lewandowski-Kurowicka-Joe distribution is a probability distribution over positive-definite symmetric matrices with unit diagonals, i.e., correlation matrices (Gelman et al. (2020)). We use this distribution as a prior for the correlation matrix of effects in the model and set it as \(LKJcorr(2)\) of 2, which makes extreme correlations less likely in the correlation matrix but still allows for correlations between the effects (McElreath (2020)).
We draw standard deviations from the half-student-t distributions, instead of normal, as it allows the possibility of extreme (tail) values, which helps us accommodate to uncertainty of these values. By defining it as “half”, we constrain the distribution only to positive values for the SD.
Given the way that the designs are bounding the possible answers to a small range, and often conduct interventions until 100% success rate is achieved, we can expect ceiling effects, large effects in the intervention phase and larger improvements (slope) for those with lower baseline results.
The hierarchical structure of the random effects follows:
\[
g_{jk} \sim \mathcal{N}(0, \sigma_{j[k]})
\]
\[
g_k \sim \mathcal{N}(0, \sigma_{k})
\]
where \(g_{jk}\) represents variability among cases within studies, \(g_{k}\) stands for variability across studies, \(\sigma_{j[k]}\) and \(\sigma_{k}\) are the standard deviations at each level.
The variance components follow a hierarchical prior:
\[
\sigma_{k} \sim \text{Half-Student}(3, 0, 2.5)
\]
\[
\sigma_{j[k]} \sim \text{Half-Student}(3, 0, 2.5)
\]
\[
\mathbf{R} \sim LKJcorr(2)
\]
\[
\alpha \sim \text{Student}(3, 3.9, 2.5)
\]
\[
\beta_1, \beta_2, \beta_3 \sim \text{Student}(3, 0, 2.5)
\]
\[
\theta = \Phi
\]
\[
\Phi \sim gamma(1,1)
\]
Moderator analysis
We will conduct a moderator analysis based on IQ, as findings by Allor et al. (2014) suggest that lower IQ is linked to lower rates of reading development.
We will introduce IQ as a continuous predictor in the main model, centered around the group mean. While grand-mean centering might be appropriate given that it is one population of participants, in this case, group-mean centering is preferable due to the likely non-random selection of intellectual disability severity within studies (Kreft, and Aiken (1995), Enders and Tofighi (2007)). Specifically, the selection of participants with a particular severity level (e.g., mild intellectual disability) may influence the intervention complexity. If IQ were centered around the grand mean, this study-level selection bias could distort estimates, making group-mean centering a more appropriate choice to account for within-study variability.
\[
\text{logit}(\mu_{ijk}) = \alpha + \beta_1 ID_{j} + \beta_2 \text{session}_{i} + \beta_3 ID_j*\text{session}_{i} + g_{jk} + g_k + \gamma_{IQ[j]}
\] \[
\gamma_{IQ[j]} \sim \mathcal{N}(\mu_{IQ[j]}, \sigma_{IQ[j]})
\]
We will include IQ as a continuous variable rather than using categorical intellectual disability severity, as a continuous measure provides more granular and informative modeling of variation. Additionally, there are potentially two types of missing values:
Missing at random (MAR) — where IQ scores are either not reported or not permitted for sharing.
Missing not at random (MNAR) — where IQ scores are unavailable because participants could not be reliably tested due to the severity of their disability.
If the data is not missing at random, we will not conduct a moderator analysis. In case of an MAR type missingness, we will impute the values with the mice package (van Buuren and Groothuis-Oudshoorn (2011)).